From the graph plot, the points could first be analyzed and clustered together by the shortest distance of the points. This would give a grouping as shown in the image below:
Initial Cluster Grouping
Based on these grouping clusters, 5 clusters are produced and then used for the Hierarchical clustering. The process of manully solving the hierarchical cluster is given in the images below:
Hierarchy cluster part 1
Hierarchy cluster part 2
First I calculated the shortest distance between the clusters. The cluster pair with the shortest distance between them was then merged together and the process repeated. This was continued until only clusters 1 and 2-3-4-5 remained. This was how the pairs of S-T and A-B-C…-Q-R was obtained.
The final dendrogram tree generated is given below:
Hierarchy cluster Tree
For this task, it was manually done starting from the reference clusters which were given. The cluster points are as follows:
| Cluster | Point |
|---|---|
| C1 | A |
| C2 | C |
| C3 | F |
| C4 | M |
From this table, manual calculation was first done as follows:
For each point (B, D, E, G, … L, N, …, T), the Eucledian distance of the point to each cluster was calculated. The point was then assigned to the cluster which had the shortest distance. This was continuously done until all the points were separated among the 4 clusters. The initial calculations for the Eucledian distances and the points is shown below:
KMean cluster page 1
KMean cluster page 2
After this was done, the following clusters were obtained:
| Cluster | Points | Mean Point |
|---|---|---|
| C1 | A D N P | (3.5, 5.39) |
| C2 | C O S T | (1.1, 2.35) |
| C3 | B F I J | (3.78, 2.25) |
| C4 | E G H K L M R Q | (6.095,2.315) |
In order to ensure balance, the distance from each point to each of the centroids was then calculated. These produced the table shown below:
KMean Distance verification
The 1st iteration of Euclidean distance shows that some points are close to cluster C1 than the initial clusters they were placed into. These points are highlighted in orange colour as shown above. The points are the moved to CLuster C1 and the KMean cluster produced is given below:
| Cluster | Points | Mean Point |
|---|---|---|
| C1 | A C D F M N P | (3.325, 5.3875) |
| C2 | S T | (0.65, 1.35) |
| C3 | B I J | (3.67, 1.83) |
| C4 | E G H K L R Q | (6.55,2.5) |
After this cluster table is produced, the distance from the points to the clusters are then canclated again. This now gives the table in the image below:
KMean Distance 2nd iteration
In this second iteration, we see that all the points correctly fall into the clusters as they should be. Therefore, there is no need for another iteration beyond this. The final cluster table produced is as follows:
| Cluster | Points | Mean Point |
|---|---|---|
| C1 | A C D F M N P | (3.325, 5.3875) |
| C2 | S T | (0.65, 1.35) |
| C3 | B I J | (3.67, 1.83) |
| C4 | E G H K L R Q | (6.55,2.5) |
For this task, I performed a Kmeans cluster analysis on the data, I varied the clusters from 30 to 100. I eventually settled for cluster size of 65 since it gave me a result which was not perfect but still usable. The output from cluster size of 65 is below:
KMean cluster
The data analysis techniques used in the analysis are as follows:
PostgreSQL and PostGIS: PostgreSQL was used as the data store to store the full dataset of 267 GB while PostGIS was used to perform geographic calculations
Data Table: This was also used to represent the impact of snowfall and rainfall on taxi ride trends.
The project idea developed is about Analysis of Student Performance Dataset. The dataset to be used can be found here: Student Dataset.
The goal of the project are listed below:
We intend to try out the classifier techniques previous taught during the study of this course and we hope to get some interesting results. Also we would visualize the data using different visualization techniques until we get one which is satisfactory
We can not always predict what we would get before performing a data anylysis, however we hope to achieve the following:
These goals will expected results will guide us on where to focus our research and what to acheive.
A team of four is required with estimated completion time of three weeks. Already we have the 4 members of the group and we will kickstart the project soonest.